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Abstract. Parallel corpora are a valuable resource for machine trans- 
, lation, but at present their availability and utility is limited by genre- 

and domain-specificity, licensing restrictions, and the basic difficulty of 
locating parallel texts in all but the most dominant of the world's lan- 
guages. A parallel corpus resource not yet explored is the World Wide 
^ ' Web, which hosts an abundance of pages in parallel translation, offering a 

ff") \ potential solution to some of these problems and unique opportunities of 

. its own. This paper presents the necessary first step in that exploration: 

' a method for automatically finding parallel translated documents on the 

Web. The technique is conceptually simple, fully language independent, 
and scalable, and preliminary evaluation results indicate that the method 
may be accurate enough to apply without human intervention. 
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In recent years large parallel corpora have taken on an important role as re- 
CJ . sources in machine translation and multilingual natural language processing, for 

such purposes as lexical acquisition (e.g. Gale and Church, 1991a; Melamed, 
1997), statistical translation models (e.g. Brown et al., 1990; Melamed 1998), 
and cross-language information retrieval (e.g. Davis and Dunning, 1995; Lan- 
dauer and Littman, 1990; also see Oard, 1997). However, for all but relatively 
few language pairs, parallel corpora are available only in relatively specialized 
forms such as United Nations proceedings (LDC, 1996), religious texts (Resnik, 
Olscn, and Diab, 1998), and localized versions of software manuals (Resnik and 
Melamed, 1997). Even for the top dozen or so majority languages, the available 
parallel corpora tend to be unbalanced, representing primarily governmental and 
newswire-style texts. In addition, like other language resources, parallel corpora 
are often encumbered by fees or licensing restrictions. For all these reasons, fol- 
lowing the "more data are better data" advice of Church and Mercer (1993), 
abandoning balance in favor of volume, is difficult. 

A parallel corpus resource not yet explored is the World Wide Web, which 
hosts an abundance of pages in parallel translation, offering a potential solution 
to some of these problems and some unique opportunities of its own. The Web 
contains parallel pages in many languages, by innumerable authors, in multiple 
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Fig. 1. The STRAND architecture 



genres and domains, and its content is continually enriched by language change 
and modified by cultural context. In this paper I will not attempt to explore 
whether such a free-wheeling source of linguistic content is better or worse than 
the more controlled parallel corpora in use today. 

Rather, this paper presents the necessary first step in that exploration: a 
method for automatically finding parallel translated documents on the Web 
that I call STRAND (Structural Translation Recognition for Acquiring Natural 
Data). The technique is conceptually simple, fully language independent, and 
scalable, and preliminary evaluation results indicate that the method may be 
accurate enough to apply without human intervention. 

In Section 2 I lay out the STRAND architecture and describe in detail the 
core of the method, a language-independent structurally based algorithm for 
assessing whether or not two Web pages were intended to be parallel translations. 
Section 3 presents preliminary evaluation, and Section 4 discusses future work. 



2 The STRAND Architecture 

As Figure 1 illustrates, the STRAND architecture is a simple pipeline. Given 
a particular pair of languages of interest, a candidate generation module first 
generates pairs (urll,url2) identifying World Wide Web pages that may be par- 
allel translations. 1 Next, a language independent candidate evaluation module 
behaves as a filter, keeping only those candidate pairs that are likely to actu- 
ally be translations. Optionally, a third module for language-dependent filtering 
applies additional filtering criteria that might depend upon language-specific re- 
sources. The end result is a set of candidate pairs that can reliably be added to 
the Web-based parallel corpus for these two languages. 

The approach to candidate evaluation taken in this paper has a useful side 
effect: in assessing the likelihood that two HTML documents are parallel trans- 

1 A URL, or uniform resource locator, is the address of a document or other resource 
on the World Wide Web. 



lations, the module produces a segment-level alignment for the document pair, 
where segments are chunks of text appearing in between markup. Thus STRAND 
has the potential of producing a segment-aligned parallel corpus rather than, or 
in addition to, a document-aligned parallel corpus. In this paper, however, only 
the quality of document-level alignment is evaluated. 2 

2.1 Candidate Generation 

At present the candidate generation module is implemented very simply. First, 
a query is submitted to the Altavista Web search engine, which identifies Web 
pages containing at least one hyperlink where 'languagel' appears in the text 
or URL associated with the link, and at least one such link for language2. 3 For 
example, Altavista's "advanced search" can be given Boolean queries in this 
form: 

anchor : "languagel" AND anchor : "language2" 

A query of this kind, using english and french as languagel and language2, re- 
spectively, locates the home page of the Academy of American & British English, 
at http://www.academyofcnglish.com/ (Figure 2), among many others. 

On some pages, images alone are used to identify alternative language ver- 
sions — the flag of France linking to a French-language page, for example, but 
without the word "French" being visible to the user. Text-based queries can still 
locate such pages much of the time, however, because the HTML markup for 
the page conventionally includes the name of the language for display by non- 
graphical browsers (in the ALT field of the IMG element). Names of languages 
sometimes also appear in other parts of a URL — for example, the file containing 
the image of the French flag might be named french.gif. The Altavista query 
above succeeds in identifying all these cases and numerous others. 

In the second step of candidate generation, each page returned by Altavista 
is automatically processed to extract all pairs (urll,url2) appearing in anchors 
(ai, 02) such that a\ contains 'languagel', 02 contains ! language2', and a\ and ai 
are no more than 10 lines apart in the HTML source for the page. This distance 
criterion captures the fact that for most Web pages that point off to parallel 
translations, the links to those translations appear relatively close together, as 
is the case in Figure 2. 

I have not experimented much with variants on this simple method for 
candidate generation, and it clearly could be improved in numerous ways to 
retrieve a greater number of good candidates. For example, it might make 

2 HTML, or hypertext markup language, is currently the authoring language for most 
Web pages. The STRAND approach should also be applicable to SGML, XML, and 
other formats, but they will not be discussed here. 

3 An "anchor" is a piece of HTML document that encodes a hypertext link. It typically 
includes the URL of the page being linked to and text the user can click on to go 
there; it may contain other information, as well. The URL for Altavista's "advanced 
search" page is http://altavista.digital.com/cgi-bin/query?pg=aq&what=web. 
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Welcome to our new site! 
This web site is available in: 

English Spanish 
lapahcse Portuguese 




Fig. 2. A page containing links to parallel translations 

sense to issue a query seeking documents in language2 with an anchor con- 
taining 'language!.' (e.g. query Altavista for pages in French containing point- 
ers to 'English', to capture the many pairs connected by a link saying 'En- 
glish version'). Or, it might be possible to exploit parallel URL and/or direc- 
tory structure; for example, the URLs http://amta98.org/en/program.html and 
http:/ /amta98.org/fr/program.html are more likely than other URL pairs to be 
referring to parallel pages, and the directory subtrees under en and fir on the 
fictitious amta98.org server might be well worth exploring for other potential 
candidate pairs. 

For this initial investigation, however, generating a reasonable set of can- 
didates was the necessary first step, and the simple approach above works well 
enough. Alternatives to the current candidate generation module will be explored 
in future work. 



2.2 Candidate Evaluation 

The core of the STRAND approach is its method for evaluating candidate pairs 
— that is, determining whether two pages should be considered parallel trans- 
lations. This method exploits two facts. First, parallel pages are filled with a 
great deal of identical HTML markup. Second, work on bilingual text alignment 
has established that there is a reliably linear relationship in the lengths of text 
translations (Gale and Church, 1991b; Melamed, 1996). The algorithm works by 
using pieces of identical markup as reliable points of correspondence and com- 
puting a best alignment of markup and non-markup chunks between the two 
documents. It then computes the correlation for the lengths of the non-markup 
chunks. A test for the significance of this correlation is used to decide whether 
or not a candidate pair should be identified as parallel text. 
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Fig. 3. Example of a candidate pair 



For example, Figure 3 shows fragments from a pair of pages identified by 
STRAND's candidate generation module in the experiment to be described in 
Section 3. An English page is at left, Spanish at right. 4 Notice the extent to 
which the page layout is parallel, and the way in which corresponding units of 
text — list items, for example — have correspondingly greater or smaller lengths. 

In more detail, the steps in candidate evaluation are as follows: 



1. Linearize. Both documents in the candidate pair are run through a markup 
analyzer that acts as a transducer, producing a linear sequence containing 
three kinds of token: 



[START :element_label] e.g. [START: A], [START: LI] 
[END:element_label] e.g. [END: A] 
[Chunk : length] e.g. [Chunk : 174] 

2. Align the linearized sequences. There are many approaches one can take to 
aligning sequences of elements. In the current prototype, the Unix sdiff utility 
does a fine job of alignment, matching up identical START and END tokens 
in the sequence and Chunk tokens of identical length in such a way as to 
minimize the differences between the two sequences. For example, consider 
two documents that begin as follows: 



4 Source: http://www.legaldatasearch.com/. 



<HTML> 

<TITLE>Emergency Exit</TITLE> 
<BODY> 

<Hl>Emergency Exit</Hl> 
If seated at an exit and 



<HTML> 

<TITLE>Sortie de Secours</TITLE> 

<BODY> 

Si vous etes assis a 

cote d'une... 



The aligned linearized sequence would be as follows: 



.5 



[START : HTML] 
[START: TITLE] 
[Chunk: 12] 
[END : TITLE] 
[START: BODY] 
[START: HI] 
[Chunk: 12] 
[END: HI] 
[Chunk: 112] 



[START: HTML] 
[START: TITLE] 
[Chunk : 15] 
[END: TITLE] 
[START: BODY] 



[Chunk : 122] 



3. Threshold the aligned, linearized sequences based on mismatches. When two 
pages are not parallel, there is a high proportion of mismatches in the align- 
ment — sequence tokens on one side that have no corresponding token on 
the other side, such as the tokens associated with the HI element in the 
above example. This can happen, for example, when two documents are 
translations up to a point, e.g. an introduction, but one document goes on 
to include a great deal more content than another. Even more frequently, 
the proportion is high when two documents are prima facie bad candidates 
for a translation pair. For these reasons, candidate pairs whose mismatch 
proportion exceeds a constant, K, are eliminated at this stage. My current 
value for K was set manually at 20% based on experience with a develop- 
ment set, and that value was frozen and used in the experiment described in 
the next section. In that experiment evaluation of STRAND was done using 
a different set of previously unseen documents, for a different language pair 
than the one used during development. 

4. Compute a confidence value. Let (X,Y) = {(x\,yi), . . . ,(x n ,y n )} be the 
lengths for the aligned Chunk tokens in Step 2, such that Xj is not equal to 
Uj . (When they are exactly equal, this virtually always means the aligned seg- 
ments are not natural language text. If included these inflate the correlation 
coefficient.) For the above alignment this would be {(12, 15), (112, 122), . . .}. 
Compute the Pearson correlation coefficient r(X,Y), and compute the sig- 
nificance of that correlation in textbook fashion. Note that the significance 
calculation takes the number n of aligned text segments into account. The 



5 Note that whitespace is ignored in counting chunk lengths. 



Fig. 4. Scatterplots illustrating reliable correlation in lengths of aligned segments for 
good translation pairs (left and right), and lack of correlation for a bad pair (center). 



resulting p value is used to threshold significance: using the standard thresh- 
old of p < .05 (i.e. 95% confidence that the correlation would not have been 
obtained by chance) worked well during development, and I retained that 
threshold in the evaluation described in the section that follows. 

Figure 4 shows plots of (A, Y) for three real candidate pairs. At left is the pair 
illustrated in Figure 3, correctly accepted by the candidate evaluation module 
with r = .99, p < .001. At center is a pair correctly rejected by candidate eval- 
uation; in this case r = .24, p > .4, and the mismatch proportion exceeds 75%. 
And at right is another pair correctly accepted; in this more unusual case, the 
correlation is lower (r = .57) but statistically very reliable because of the large 
number of data points (p < .0005). 

Notice that a by-product of this structurally-driven candidate evaluation 
scheme is a set of aligned Chunk tokens. These correspond to aligned non- markup 
segments in the document pair. Evaluating the accuracy of this segment-level 
alignment is left for future work. 

2.3 Language-Dependent Filtering 

I have not experimented with further filtering of candidate pairs since, as shown 
in the next section, precision is already quite high. However, experience with 
the small number of false positives I have seen suggests that automatic language 
identification on the remaining candidate pairs might weed out the few that re- 
main. Very high accuracy language identification using character n-gram models 
requires only a modest amount of training text known to be in the languages of 
interest (Dunning, 1994; Grefenstette, 1995). 

3 Evaluation 

I developed the STRAND prototype using English and French as the relevant 
pair of languages. For evaluation I froze the code and all parameters and ran 



the prototype for English and Spanish, not having previously looked at En- 
glish/Spanish pairings on the Web. 

For the candidate generation phase, I followed the approach of Section 2.1 
and generated candidate document pairs from the first 200 hits returned by the 
Altavista search engine, leading to a set of 198 candidate pairs of URLs that 
met the distance criterion. 

Of those 198 candidate pairs, 12 were pairs where urll and url2 pointed 
to identical pages, and so these are eliminated from consideration. In 96 cases 
one or both pages in the pair could not be retrieved (page not found, moved, 
empty, server unreachable, etc.). The remaining 90 cases are considered the set 
of candidate pairs for evaluation. 

I evaluated the 90 candidate pairs by hand, determining that 24 represented 
true translation pairs. 6 The criterion for this determination was the question: 
Was this pair of pages intended to provide the same content in the two different 
languages? Although admittedly subjective, the judgments are generally quite 
clear; I include URLs in an on-line Appendix so that the reader may judge for 
himself or herself. The STRAND prototype's performance against this test set 
was as follows: 

— The candidate evaluation module identified 17 of the 90 candidate pairs as 
true translations, and was correct for 15 of those 17, a precision of 88.2%. (A 
language-dependent filtering module with 100% correct language identifica- 
tion would have eliminated one of the two false positives, giving a precision 
of 93.8%. However, language-dependent filtering was not used in this evalu- 
ation.) 

— The algorithm identified 15 of 24 true translation pairs, a recall of 62.5%. 

Manual assessment of the translation pairs retrieved by the algorithm suggests 
that they are representative of what one would expect to find on the Web: the 
pages vary widely in length, content, and the proportion of usable parallel natural 
language text in comparison to markup, graphics, and the like. However, I found 
the yield of genuine parallel text — content in one language and its corresponding 
translation in the other — to be encouraging. The reader may form his or her 
own judgment by looking at the pages identified in the on-line Appendix. 

4 Future Work 

At present it is difficult to estimate how many pairs of translated pages may ex- 
ist on the World Wide Web. However, it seems fair to say that there are a great 
many, and that the number will increase as the Web continues to expand inter- 
nationally. The method for candidate generation proposed in this paper makes 

6 A few of the 90 candidate pairs were encoded in non-HTML format, e.g. PDF 
(portable document format). I excluded these from consideration a priori because 
STRAND'S capabilities are currently limited to HTML. 



it possible to quickly locate candidate pairs without building a Web crawler, 
but in principle one could in fact think of the entire set of pages on the Web 
as a source for candidate generation. The preliminary figures for recall and es- 
pecially for precision suggest that large parallel corpora can be acquired from 
the Web with only a relatively small degree of noise, even without human filter- 
ing. Accurate language-dependent filtering (e.g. based on language identification, 
as in Section 2.3) would likely increase the precision, reducing noise, without 
substantially reducing the recall of useful, true document pairs. In addition to 
language-dependent filtering, the following are some areas of investigation for 
future work. 

— Additional evaluation. As advertised in the title of this paper, the results 
thus far are preliminary. The STRAND approach needs to be evaluated with 
other language pairs, on larger candidate sets, with independent evaluators 
being used in order to accurately estimate an upper bound on the reliability 
of judgments as to whether a candidate pair represents a true translation. 
One could also evaluate how precision varies with recall, but I believe for this 
task there are sufficiently many genuine translation pairs on the Web and 
a sufficiently high recall that the focus should be on maximizing precision. 
Alternative approaches to candidate generation from the Web, as discussed 
in Section 2.1, are a topic for further investigation. 

— Scalability. The prototype, implemented in decidedly non-optimized fash- 
ion using a combination of perl, C, and shell scripts, currently evaluates 
candidate pairs at approximately 1.8 seconds per candidate on a Sun Ul- 
tra 1 workstation with 128 megabytes of real memory, when the pages are 
already resident on a disk on the local network (though not local to the 
workstation itself). Thus, excluding retrieval time of pages from the Web, 
evaluating 1 million retrievable candidate pairs using the existing prototype 
would take just over 3 weeks of real time. However, STRAND can easily 
be run in parallel on an arbitrary number of machines, and the prototype 
reimplemented in order to obtain significant speed-ups. The main bottleneck 
to the approach, the time spent retrieving pages from the Web, is still trivial 
if compared to manual construction of corpora. In real use, STRAND would 
probably be run as a continuous process, constantly extending the corpus, 
so that the cost of retrieval would be amortized over a long period. 

— Segment alignment. As discussed in Section 2.2, a by-product of the can- 
didate evaluation module in STRAND is a set of aligned text segments. The 
quality of the segment-level alignment needs to be evaluated, and should be 
compared against alternative alignment algorithms based on the document- 
aligned collection. 

— Additional filtering. Although a primary goal of this work is to obtain a 
large, heterogeneous corpus, for some purposes it may be useful to further 
filter document pairs. For example, in some applications it might be impor- 



tant to restrict attention to document pairs that conform to a particular 
genre or belong to a particular topic. The STRAND architecture of Figure 1 
is clearly amenable to additional filtering modules such as document classifi- 
cation incorporated into, or pipelined with, the language-dependent filtering 
stage. 

— Dissemination. Although text out on the Web is generally intended for 
public access, it is nonetheless protected by copyright. Therefore a corpus 
collected using STRAND could not legally be distributed in any straight- 
forward way. However, legal constraints do not prevent multiple sites from 
running their own versions of STRAND, nor any such site from distributing 
a list of URLs for others to retrieve themselves. Anyone implementing this 
or a related approach should be careful to observe protocols governing au- 
tomatic programs and agents on the Web. 7 

The final and most interesting question for future work is: What can one do 
with a parallel corpus drawn from the World Wide Web? I find two possibilities 
particularly promising. First, from a linguistic perspective, such a corpus offers 
opportunities for comparative work in lexical semantics, potentially providing a 
rich database for the cross-linguistic realization of underlying semantic content. 
From the perspective of applications, the corpus is an obvious resource for ac- 
quisition of translation lexicons and distributionally derived representations of 
word meaning. Most interesting of all, each possibility is linked to many others, 
seemingly without end — much like the Web itself. 
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Appendix: Experimental Data 

At URL http:/ /umiacs.umd.edu/~rcsnik/amta98/amta98^ppendix.html the in- 
terested reader can find an on-line Appendix containing the complete test set 
described in Section 3, with STRAND's classifications and the author's judg- 
ments. 
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